In this report we discuss the creation of a linear model to model
median house value of a block group in California (CA) in the 1990’s
based on administrative (Census) data. Given the recent turbulence in
the real estate market and associated bidding wars, there is a renewed
interest in what the value of a house is. Utilizing methods from STAT
420, we underwent model selection to identify the optimal model that
minimized test RMSE, met the assumptions of linear modelling, and had
the fewest parameters to enhance explainability. The best linear model
was the full additive model, which included the parameters
longitude, latitude,
housing_median_age, total_rooms,
population, median_income, and
ocean_proximity.This model had a test RMSE of 0.3407 and an
adjusted \(R^2\) of 0.7597. Our model
provides utility in explaining the relationship between how features of
a California Block Group relate to the median house value of the block
group.
For our project we leveraged a data set from kaggle California Housing Prices. The data contains information from the 1990 California census and pertains to houses found in a given California district. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
The California Housing Prices data set pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. There are 10 columns and 20,640 rows in the data. The columns are as mentioned below:
| Column Name | Description |
|---|---|
| longitude | Longitudinal location of block group of houses. |
| latitude | Latitudinal location of block group of houses |
| housing_median_age | Median age of the houses in the block group |
| total_rooms | Total rooms in the group of houses |
| total_bedrooms | Total bedrooms in the block group |
| population | Total number of people in the block group |
| households | Households comprised in the block group |
| median_income | Median income calculated from the individual income of house. |
| median_house_value | Median house value of a block group. |
| ocean_proximity | Indicating whether each block group is near the ocean, near the Bay Area, inland or on an island. |
In order to ensure the reader understands the caveats of working with our data, we first want to underscore the differences between census housing data and typical sources of real estate data. The census data set, unlike scraped real estate data, is gathered by an administrative body, the Census. Since the stakeholder involved in collecting the data is not interested in selling a house, we see different features represented that relate more to geography (Latitude, longitude), population density (population, number of households), and age of the building (median_housing_age).The data set is aggregated at the block level (rather than at the house level). This means our analysis was not directly modelling housing values, but the median housing value of a neighborhood. We believe this helped us better to understand the value of the neighborhood as opposed to the value of the amenities of the individual houses.
Another important limitation of the source of this data is the date range. Since the data were gathered as part of the 1990 Census, we cannot extrapolate to current housing prices or utilize the model to explain what features are important to median house value at any other time period. For this particular reason, we focused our project on modelling rather than prediction. While there are few use cases for a prediction algorithm for median house value of a census block group in the 1990’s at the present date, understanding historical features that related to a census block groups’ price point provides an opportunity for understanding how median house price varied based on parameters at the time of data collection.
We will now discuss the steps we used to generate our linear model. We began by conducting exploratory data analysis (EDA). Based on the EDA, we identified steps necessary to clean our data. In order to ensure our model was not overfit, we split the data set into a 70-30 train-test split. Next we underwent model selection process to minimize the test RMSE of our model. In order to conduct model selection and graph the results, we leveraged helper functions that we will describe before diving into the model selection.
Function : model_performance_metrics
will generate model performance metrics such as RMSE and R-squared,
etc.
Inputs:
models : A list of models to generate the performance metrics
model_names : Names of the models passed to the function
data : The actual data which will be used to calculate predicted values and RMSE for each model
Output : Table with following performance metrics
as columns and a row for each model provided in model_names
as input.
Adjusted R-Squared
Root Mean Square Error on Training Data
Root Mean Square Error on Test Data
Leave One Out Cross Validated RMSE
Total number of coefficients in the model
model_performance_metrics = function(models, model_names, data){
# Calculate model names and total numbers
n_mod = length(models)
# Initialize the data frame to be be built by the function
perf_metrics_df = data.frame(
row.names = model_names,
"ADJR2" = rep(0L, n_mod),
"Train.RMSE" = rep(0L, n_mod),
"Test.RMSE" = rep(0L, n_mod),
"LOOCV_RMSE" = rep(0L, n_mod),
"Coefs" = rep(0L, n_mod)
)
# Extract actual response values
y_i = log(data[["median_house_value"]])
for (idx in 1:n_mod){
perf_metrics_df[model_names[idx], "ADJR2"] = summary(models[[idx]])$adj.r.squared
perf_metrics_df[model_names[idx], "Train.RMSE"] = sqrt(mean(resid(models[[idx]]) ^ 2))
y_hat = predict(models[[idx]], newdata = data)
perf_metrics_df[idx, "Test.RMSE"] = sqrt(mean((y_i - y_hat) ^ 2))
perf_metrics_df[model_names[idx], "LOOCV_RMSE"] = sqrt(mean((resid(models[[idx]]) /
(1 - hatvalues(models[[idx]]))) ^ 2))
perf_metrics_df[model_names[idx], "Coefs"] = length(coef(models[[idx]]))
}
# Print Model performance on train data
kable(perf_metrics_df,
col.names = c("Adjusted R-Squared",
"Train RMSE",
"Test RMSE",
"LOOCV RMSE",
"Coefficients"),
caption = "Comparative Model Performance Metrics")
}
Function : get_model_diagnostic_tests
will generate model diagnostic tests such as B-P Test and Shapiro-Wilk
test to check homoscedasticity and normality of residuals
Inputs:
models : A list of models and model names to generate the results table
Output : Table with following diagnostic metrics as
columns and a row for each model provided in model_names as
input.
B-P Test P-Value
B-P Test Decision
Shapiro-Wilk Test P-Value
Value Shapiro-Wilk Test Decision
get_model_diagnostic_tests = function(models, model_names){
n_mod = length(models)
diagnostics_metrics_df = data.frame(
row.names = model_names,
"BP_P_Value" = rep(0.0, n_mod),
"BP_DEC" = rep("", n_mod),
"SW_P_VALUE" = rep(0.0, n_mod),
"SW_DEC" = rep("", n_mod)
)
for (idx in 1:n_mod){
bp_test_res = get_bp_decision(models[[idx]])
diagnostics_metrics_df[model_names[idx], "BP_P_Value"] = bp_test_res[1]
diagnostics_metrics_df[model_names[idx], "BP_DEC"] = bp_test_res[2]
sw_test_res = get_sw_decision(models[[idx]])
diagnostics_metrics_df[model_names[idx], "SW_P_VALUE"] = sw_test_res[1]
diagnostics_metrics_df[model_names[idx], "SW_DEC"] = sw_test_res[2]
}
# Diagnostics of Models
kable(diagnostics_metrics_df,
col.names = c("B-P Test P-Value",
"B-P Test Decision",
"Shapiro-Wilk Test P-Value",
"Shapiro-Wilk Test Decision"),
caption = "Comparative Model Diagnostics")
}
# function to get decision of Breusch-Pagan Test (for checking homoscedasticity)
get_bp_decision = function(model, alpha=0.05) {
bp_pval = bptest(model)$p.value
c(format.pval(bp_pval), ifelse(bp_pval < alpha, "Reject", "Fail to Reject"))
}
# function to get decision of Shapiro-Wilk Test (for normality of error term)
get_sw_decision = function(model, alpha=0.05) {
sw_pval = shapiro.test(resid(model)[1:5000])$p.value
# shapiro.test has a max cap of 5000, our dataset has over 20000 observations so sampling it down to 5000 for the method to function
#c(format.pval(shapiro.test(sample(resid(model), 5000))$p.value), ifelse(decide, "Reject", "Fail to Reject"))
c(format.pval(sw_pval), ifelse(sw_pval < alpha, "Reject", "Fail to Reject"))
}
Function : get_model_diagnostic_plots
will generate model diagnostic plots such as Q-Q plot or fitted
vs. residuals plot
Inputs:
models : A list of models to generate the plots
Returns : None
Output : Plots showing diagnostic metrics ‘Fitted vs. Residuals’ & ‘Q-Q Plots’ for our top two models.
Fitted vs. Residuals plots to check LINE assumption - Heteroscedasticity
Q-Q Plot to check LINE assumption - normality of error term
get_model_diagnostic_plots = function(models, model_names){
n_mod = length(models)
par(mfrow=c(1,2))
for (i in 1:n_mod){
plot_fitted_resid(models[[i]], model_names[[i]])
}
par(mfrow=c(1,2))
for (i in 1:n_mod){
plot_qq(models[[i]], model_names[[i]])
}
}
# function to plot fitted vs. residual plot to test homoscedasticity - uniform variation in error plotted against the fitted value or response
plot_fitted_resid = function(model, model_name) {
plot(fitted(model), resid(model), col = green_color, pch = 20, xlab = "Fitted", ylab = "Residuals", main = paste0("Fitted vs. Residual for ", model_name), sub = model_name)
abline(h = 0, col = "darkorange", lwd = 2)
}
# function to plot qq-plot & qq-line
plot_qq = function(model, model_name) {
qqnorm(resid(model), col = green_color, pch = 20, main = paste0("Q-Q Plot for ", model_name), sub = model_name)
qqline(resid(model), col = "darkorange", lwd = 2)
}
Our first step to cleaning the data was reading in the CSV file from kaggle.com. Our team members reviewed the first five rows to ensure that there were no encoding issues reading in the raw data into R.
#read in the data
read_in <- function(){
ca_housing_data = read.csv('../000_Data/california-housing-prices/housing.csv')
ca_housing_data
}
ca_housing_data <- read_in()
head(ca_housing_data)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.2 37.88 41 880 129 322
## 2 -122.2 37.86 21 7099 1106 2401
## 3 -122.2 37.85 52 1467 190 496
## 4 -122.2 37.85 52 1274 235 558
## 5 -122.2 37.85 52 1627 280 565
## 6 -122.2 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.325 452600 NEAR BAY
## 2 1138 8.301 358500 NEAR BAY
## 3 177 7.257 352100 NEAR BAY
## 4 219 5.643 341300 NEAR BAY
## 5 259 3.846 342200 NEAR BAY
## 6 193 4.037 269700 NEAR BAY
The first thing we we looked at were basic summary statistics and
missingness. We found that there were 207 observation missing
total_bedrooms. We cannot generate a linear model with
missing data. However, looking at the collinearity, we will find a
solution that allows us to keep these observations.
summary(ca_housing_data)
## longitude latitude housing_median_age total_rooms
## Min. :-124 Min. :32.5 Min. : 1.0 Min. : 2
## 1st Qu.:-122 1st Qu.:33.9 1st Qu.:18.0 1st Qu.: 1448
## Median :-118 Median :34.3 Median :29.0 Median : 2127
## Mean :-120 Mean :35.6 Mean :28.6 Mean : 2636
## 3rd Qu.:-118 3rd Qu.:37.7 3rd Qu.:37.0 3rd Qu.: 3148
## Max. :-114 Max. :42.0 Max. :52.0 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1 Min. : 3 Min. : 1 Min. : 0.50
## 1st Qu.: 296 1st Qu.: 787 1st Qu.: 280 1st Qu.: 2.56
## Median : 435 Median : 1166 Median : 409 Median : 3.54
## Mean : 538 Mean : 1425 Mean : 500 Mean : 3.87
## 3rd Qu.: 647 3rd Qu.: 1725 3rd Qu.: 605 3rd Qu.: 4.74
## Max. :6445 Max. :35682 Max. :6082 Max. :15.00
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
#How many missing values do we have?
colSums(is.na(ca_housing_data))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## median_house_value ocean_proximity
## 0 0
As part of data cleaning, we plotted the
median_house_value against latitude and longitude to
visualize the locations of the different houses. One of the first things
we noticed were two distinct clusters of highly priced houses that
appear to correspond to Los Angeles and the San Francisco Bay Area in
bright red. The lighter green dots in the plot illustrate the lowest
priced houses. Note that this scale differentiates by color using the
median median_house_value as a midpoint, which helps to
illustrate concentrated areas of high price values.
## Map the data
#Source: https://stackoverflow.com/questions/65233613/plot-a-map-using-lat-and-long-with-r
my_sf <- st_as_sf(ca_housing_data, coords = c('longitude', 'latitude'))
my_sf <- st_set_crs(my_sf, value = 4326)
hs_value_plot <- ggplot(my_sf) +
geom_sf(aes(color = median_house_value)) +
labs( y = "Longitude",
x = "Latitude",
title = "Graphic Of Median House Value by Location") +
scale_colour_gradient2(low = green_color,
midpoint = median(ca_housing_data$median_house_value),
high = "#D55E00")
hs_value_plot
We also generated histograms for all relevant predictors. Please note that not all EDA was included in this report; additional code from EDA can be found on our GitHub Repo for All Project Related Code. The linked section includes the link and describes the relevant .rmd files that were out of scope of this report.
The most important finding discovered while performing EDA was the
long tail of our outcome variable median_house_value. As
displayed below in the histogram, we have a number of houses that are at
the top end of the range.
ggplot(ca_housing_data, aes(median_house_value)) +
geom_histogram(fill = green_color, color= "#FFFFFF", bins = 15) +
labs( y = "Count of Block Groups",
x = "Median House Value in USD",
title = "Histogram of Median House Values")
Based on the histogram, we see that our response variable
could benefit from a log transformation. Generally most dollar
variables require a log transformation, and
median_house_value is no different. This log transformation
will also assist with linear model assumptions during model
selection.
ggplot(ca_housing_data, aes(log(median_house_value))) +
geom_histogram(fill = green_color, color= "#FFFFFF", bins = 15) +
labs( y = "Count of Block Groups",
x = "Median House Value in USD",
title = "Histogram of Median House Values")
In addition to looking at graphs and reviewing summary statistics, we
also began to identify variables with collinearity issues via a
correlation matrix and pairwise plot. In the pairwise plot we see that
total_bedrooms is highly collinear with
total_rooms, but total_rooms is not missing
any values. With this in mind, we can utilize the variable
total_rooms to preserve the 207 observations.
## pairs & Cor matrix
numeric_vals <- unlist(lapply(ca_housing_data, is.numeric), use.names = FALSE)
pairs(ca_housing_data[,numeric_vals], col=green_color)
round(cor(ca_housing_data[,numeric_vals]), 2)
## longitude latitude housing_median_age total_rooms
## longitude 1.00 -0.92 -0.11 0.04
## latitude -0.92 1.00 0.01 -0.04
## housing_median_age -0.11 0.01 1.00 -0.36
## total_rooms 0.04 -0.04 -0.36 1.00
## total_bedrooms NA NA NA NA
## population 0.10 -0.11 -0.30 0.86
## households 0.06 -0.07 -0.30 0.92
## median_income -0.02 -0.08 -0.12 0.20
## median_house_value -0.05 -0.14 0.11 0.13
## total_bedrooms population households median_income
## longitude NA 0.10 0.06 -0.02
## latitude NA -0.11 -0.07 -0.08
## housing_median_age NA -0.30 -0.30 -0.12
## total_rooms NA 0.86 0.92 0.20
## total_bedrooms 1 NA NA NA
## population NA 1.00 0.91 0.00
## households NA 0.91 1.00 0.01
## median_income NA 0.00 0.01 1.00
## median_house_value NA -0.02 0.07 0.69
## median_house_value
## longitude -0.05
## latitude -0.14
## housing_median_age 0.11
## total_rooms 0.13
## total_bedrooms NA
## population -0.02
## households 0.07
## median_income 0.69
## median_house_value 1.00
#Population and households are collinear
#Population and Total rooms are also collinear
#latitude and longitude are also highly correlated
#Remove total_bedrooms
ca_housing_data <- subset(ca_housing_data, select=-c(total_bedrooms))
We have another column ocean_proximity which is a
categorical variable indicating how close the block is from ocean. Using
latitude and longitude, we can see how this factor variable helps to
classify different census blocks. Keeping in mind that we saw distinct
clusters of highly priced median house value in the Bay Area (see the
blue “NEAR BAY” region) and LA (around 34N, 118W “NEAR OCEAN” region),
this factor along with latitude and longitude will help us to capture
the geographic nature of housing value.
ggplot(my_sf) +
geom_sf(aes(color = ocean_proximity)) +
labs( y = "Longitude",
x = "Latitude",
title = "Ocean Proximity by Location")
In the table below, the reader can see this column has 5 different
categories. It is important to note that the ISLAND
category has only 5 observations. During initial analysis, our team
determined that ISLAND has a substantially higher average
value of response variable median_house_value than the
other categories as illustrated by the below box-plot. This fact is
problematic for our purpose. We will be dropping the rows from data set
with ISLAND category and coercing the predictor
ocean_proximity to be categorical.
#Ocean proximity - only 5 in island, and substantially different. I propose removing from consideration to avoid overfitting depending on an unlucky test/train split
table(ca_housing_data$ocean_proximity)
##
## <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
## 9136 6551 5 2290 2658
boxplot(median_house_value ~ ocean_proximity,
ca_housing_data,
main = "Boxplot of Median House Value by Ocean Proximity",
ylab = "Median House value in USD",
xlab = "Ocean Proximity",
col = green_color)
#Remove the islands rows
ca_housing_data = ca_housing_data[ca_housing_data$ocean_proximity != "ISLAND",]
#Now we turn it into a factor after dropping that level
ca_housing_data$ocean_proximity <- as.factor(ca_housing_data$ocean_proximity)
At this point, we have concluded the prerequisite data cleaning based on EDA and can move onto model selection. Our final data set used for modelling has 20635 rows and 9 columns. The summary statistics can be found below.
#Final data set and quality assurance
summary(ca_housing_data)
## longitude latitude housing_median_age total_rooms
## Min. :-124 Min. :32.5 Min. : 1.0 Min. : 2
## 1st Qu.:-122 1st Qu.:33.9 1st Qu.:18.0 1st Qu.: 1448
## Median :-118 Median :34.3 Median :29.0 Median : 2127
## Mean :-120 Mean :35.6 Mean :28.6 Mean : 2636
## 3rd Qu.:-118 3rd Qu.:37.7 3rd Qu.:37.0 3rd Qu.: 3148
## Max. :-114 Max. :42.0 Max. :52.0 Max. :39320
## population households median_income median_house_value
## Min. : 3 Min. : 1 Min. : 0.50 Min. : 14999
## 1st Qu.: 787 1st Qu.: 280 1st Qu.: 2.56 1st Qu.:119600
## Median : 1166 Median : 409 Median : 3.54 Median :179700
## Mean : 1426 Mean : 500 Mean : 3.87 Mean :206814
## 3rd Qu.: 1725 3rd Qu.: 605 3rd Qu.: 4.74 3rd Qu.:264700
## Max. :35682 Max. :6082 Max. :15.00 Max. :500001
## ocean_proximity
## <1H OCEAN :9136
## INLAND :6551
## NEAR BAY :2290
## NEAR OCEAN:2658
##
##
str(ca_housing_data)
## 'data.frame': 20635 obs. of 9 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : Factor w/ 4 levels "<1H OCEAN","INLAND",..: 3 3 3 3 3 3 3 3 3 3 ...
#QA: assert no missing values
if(any(colSums(is.na(ca_housing_data)) > 0)){
stop("Fatal error. Missing data is one or more columns")
}
#QA: Assert that there are only 4 levels in ocean_proximity
if(length(levels(ca_housing_data$ocean_proximity)) != 4){
stop("Fatal error. There are not 4 levels in ocean_proximity")
}
After cleaning the data, we proceeded with building a model. The very first step we took was to divide data into train and test. Based on total number of observations in the data set, team decided to use 70%-30% split for train-test data.
We first initialized global variables and created helper functions (which are listed above in Helper Functions).
# build different models to test the performance
n_models = 9
# create a vector list of models to pass to various functions
cahoudta_models = vector("list", length = n_models)
cahoudta_model_names = vector("character", length = n_models)
# Train-Test Split Ration
tr_pct = 0.70
In order to replicate the results, we set a seed to ensure that the test train split of our data was consistent between team members. After creating both the test and train data set, we began building models.
set.seed(420072022)
# Randomly select train indexes
cahoudta_trn_idx = sample(nrow(ca_housing_data),
size = trunc(tr_pct * nrow(ca_housing_data)))
# Split into train and test data sets based on the index
ca_housing_data.tr = ca_housing_data[cahoudta_trn_idx, ]
ca_housing_data.te = ca_housing_data[-cahoudta_trn_idx, ]
n_tr_rows = nrow(ca_housing_data.tr)
n_te_rows = nrow(ca_housing_data.te)
In the process of model selection, we built a total of 9 different models. While recursively analyzing different model predictors and transformations, our team identified that there are few outlier observations which result in issues with the model training. So we circled back to the very first full additive model and identified these outliers using Cook’s Distance.
cahoudta_full_add_model_cd = lm(log(median_house_value) ~ ., data = ca_housing_data.tr)
cd_full_add_mod = cooks.distance(cahoudta_full_add_model_cd)
(count_outliers = sum(cd_full_add_mod > 4 / length(cd_full_add_mod)))
## [1] 841
percent_outliers = count_outliers / nrow(ca_housing_data.tr)
There were a total of 841 observations deemed as influential based on the heuristic \(D_i > \frac{4}{n}\) where \(D_i\) is the distance of the \(i\)th observation and \(n\) is the number of observations. Given that these outliers represent only 6% of the training data, our team decided to drop these outliers from the training data set.
When we re-fit the full additive model by ignoring these 841 outliers we found that ultimately this model supplied the best Test RMSE.
cahoudta_full_add_model = lm(log(median_house_value) ~ .,
data = ca_housing_data.tr,
subset = (cd_full_add_mod <= 4 / length(cd_full_add_mod)))
After making this adjustment, we investigated multi-collinearity in the full fitted additive model. Given this model includes all variables, we wanted to ascertain if we would be introducing issues using highly collinear variables. To evaluate this empirically, we calculated the variable inflation factor (VIF) of the predictors and printed those with a VIF greater than 5.
high_vif_predictors = vif(cahoudta_full_add_model) > 5
high_vif_pred_vals = vif(cahoudta_full_add_model)[high_vif_predictors]
sort(high_vif_pred_vals, decreasing = TRUE)
## latitude longitude households total_rooms population
## 22.32 19.84 13.55 11.23 7.35
Based on the above VIF values, we concluded the predictors
households and latitude could be dropped.
However, we ultimately found these predictors were consistently selected
in procedures that we used to optimize the model. Keeping these
collinear variables in the final selected model is a limitation of our
final model.
We also reviewed the pairs() function with a new column
for the logged response value to ensure we were aware of collinearity.
The secondary reason for re-running this plot during model selection was
to generate ideas about the relationships between the predictors and the
response. For instance, if we had seen a clearly quadratic relationship,
we could have used a polynomial term to improve fit.
pairs(subset(cbind(ca_housing_data.tr, log(ca_housing_data.tr$median_house_value)),
select = -ocean_proximity),
col = green_color)
When we reviewed the pairs plot and for response variable
log(median_house_value), we noticed the following
observations:
median_income, households,
total_rooms have some polynomial relationship.housing_median_age has lot of variance and
needs variable transformation.latitude or longitude can be
eliminated due to high partial collinearity.At this point, we will return to model building with the above
hypotheses coming into play in later models. In order to ensure that we
selected the model that minimized test RMSE without sacrificing
interpretability, we decided to conduct a backward search using BIC. We
decided to utilized BIC over AIC because BIC’s penalty accounts for
penalizing larger models. Using the step() function for our
BIC backward selection with the full additive model as our base, we
found that the resulting model removed the predictor
total_rooms. The implication is that the removal of
total_rooms would result in a smaller model with better
predictive power. Since the step-wise search stopped after removing a
single variable, the algorithm found that removing any additional
variables would result in an unacceptable increase in the BIC.
cahoudta_add_bckwd_bic_model = step(cahoudta_full_add_model,
direction = "backward",
k = log(n_tr_rows),
trace = 0)
Our next attempt was to create a model with two-way interactions for each predictor. Our thinking here was that we could use the full interaction model as a basis for another backward search. Given the number of terms in the two way interaction model, this model sacrifices interpretability for the sake of lower train RMSE and higher \(R^2\).
cahoudta_2int_add_model = lm(log(median_house_value) ~ . ^ 2,
data = ca_housing_data.tr)
ncoefs_2int_add_model <- length(coef(cahoudta_2int_add_model))
We next tried backward BIC search from two way interaction full additive model.
cahoudta_2int_bckwd_bic_model = step(cahoudta_2int_add_model,
direction = "backward",
k = log(n_tr_rows),
trace = 0)
ncoefs_2int_backward_bic_model <- length(coef(cahoudta_2int_bckwd_bic_model))
The resulting backward BIC model has 44 compared to the 53 coefficients with which the full interaction model started. Although this decrease is an improvement, interpretability of the model will still be very difficult.
Our next set of models utilized our hypotheses about removing
collinear variables to improve model performance. We started by dropping
the predictors households and latitude due to
their high VIF. The result was an additive model comprised of only low
VIF predictors. However, the Adjusted \(R^2\) value of this model dropped by about
0.1 relative to the previous models.
predictor_variables = colnames(ca_housing_data.tr)[-which(colnames(ca_housing_data.tr)
== "median_house_value")]
drop_cols = which(colnames(ca_housing_data.tr) == "median_house_value" |
colnames(ca_housing_data.tr) == "households" |
colnames(ca_housing_data.tr) == "latitude")
smaller_predictors = colnames(ca_housing_data.tr)[-drop_cols]
less_preds = paste0("log(median_house_value) ~ ",
paste(smaller_predictors, collapse = " + "))
cahoudta_lowvif_add_mod = lm(formula = less_preds, data = ca_housing_data.tr)
Due to the drop in the Adjusted \(R^2\), we tried a model that added two-way interactions for the low VIF variables. Unfortunately, this did not substantially improve the \(R^2\) or train RMSE.
less_preds_2int = paste0("log(median_house_value) ~ (",
paste(smaller_predictors, collapse = " + "), ") ^ 2")
cahoudta_lowvif_2int_mod = lm(formula = less_preds_2int, data = ca_housing_data.tr)
We held out hope for the low VIF predictors as a source of a good
model. In fact, we tried to leverage a 2nd degree polynomial to see if
the model would improve. Note that in creation of this model we had to
exclude the categorical predictor ocean_proximity from
polynomial formula. Sadly, the Adjusted \(R^2\) improved by less than 0.01 for this
addition in model complexity.
drop_cols = which(colnames(ca_housing_data.tr) == "median_house_value" |
colnames(ca_housing_data.tr) == "households" |
colnames(ca_housing_data.tr) == "latitude")
reduced_predictors = colnames(ca_housing_data.tr)[-drop_cols]
less_preds = paste0("log(median_house_value) ~ ", paste(smaller_predictors,
collapse = " + "))
further_reduced_predictors = reduced_predictors[-which(reduced_predictors
== "ocean_proximity")]
less_poly_preds = paste0(" + I(", paste(further_reduced_predictors,
collapse = " ^ 2) + I("), " ^ 2)")
poly_formula = paste0(less_preds, less_poly_preds)
message("Polynomial Formula: ", poly_formula)
## Polynomial Formula: log(median_house_value) ~ longitude + housing_median_age + total_rooms + population + median_income + ocean_proximity + I(longitude ^ 2) + I(housing_median_age ^ 2) + I(total_rooms ^ 2) + I(population ^ 2) + I(median_income ^ 2)
cahoudta_lowvif_2poly_mod = lm(formula = poly_formula, data = ca_housing_data.tr)
After having little to no success with removing the high VIF
predictors, we decided to reinstate one of the collinear variables,
latitude. This model with both longitude and
latitude but without households lacked the
performance requisite to justify its selection. Specifically, we saw
that this model has a lower adjusted \(R^2\) by 0.1 than the full additive model.
Additionally, the train RMSE was 0.07 higher, which was in the range of
the other models which had dropped both latitude and
households.
drop_cols = which(colnames(ca_housing_data.tr) == "median_house_value" |
colnames(ca_housing_data.tr) == "households")
lesser_predictors = colnames(ca_housing_data.tr)[-drop_cols]
less_preds = paste0("log(median_house_value) ~ ",
paste(lesser_predictors,
collapse = " + "))
message("Model 8: Formula: ", less_preds)
## Model 8: Formula: log(median_house_value) ~ longitude + latitude + housing_median_age + total_rooms + population + median_income + ocean_proximity
cahoudta_lola_add_mod = lm(formula = less_preds, data = ca_housing_data.tr)
Our final step was to attempt variable transformations to attempt to
better capture the variance in the response variable. We transformed
predictors population and total_rooms to be
normalized with predictor households. We thought derived
variables ‘rooms_per_hh’ and ‘pop_per_hh’ would add predictive power to
the model because of the following reasons.
rooms_per_hh: houses with higher number of rooms per
household in a block would be larger having a positive correlation with
median housing prices
pop_per_hh: block with bigger houses will attract
larger families (population_per_household)
ca_housing_data.tr$pop_per_hh = (ca_housing_data.tr$population /
ca_housing_data.tr$households)
ca_housing_data.tr$rooms_per_hh = (ca_housing_data.tr$total_rooms /
ca_housing_data.tr$households)
ca_housing_data.te$pop_per_hh = (ca_housing_data.te$population /
ca_housing_data.te$households)
ca_housing_data.te$rooms_per_hh = (ca_housing_data.te$total_rooms /
ca_housing_data.te$households)
We fit a additive model with transformed variables and predictors without high variation inflation factors or collinearity. While both train RMSE and adjusted \(R^2\) plummeted, when we reveled the test RMSE we found that this model was significantly overfit for our training data. The test RMSE of the model with transformed variables was nearly 5 times that of the other models. Our team learned a great deal about how some variable transformations can contribute a good training model based on observations of the data, but when extrapolated to unseen data, they can wreck havoc.
cahoudta_2int_txfd_model_cd = lm(log(median_house_value) ~
(longitude +
housing_median_age +
median_income +
pop_per_hh +
rooms_per_hh +
ocean_proximity ) ^ 2,
data = ca_housing_data.tr)
cd_2int_txfd = cooks.distance(cahoudta_2int_txfd_model_cd)
cahoudta_2int_txfd_model = lm(log(median_house_value) ~
(longitude +
housing_median_age +
median_income +
pop_per_hh +
rooms_per_hh +
ocean_proximity ) ^ 2,
data = ca_housing_data.tr,
subset = (cd_2int_txfd <= 4 / length(cd_2int_txfd)))
After creating the 9th and final model, we recognized we were getting diminishing returns on additional iterations. At this point, we generated our evaluation criteria into a chart and discussed as a group what model we would like to select.
We next compared all the models to calculate and display the performance metrics.
cahoudta_models[[1]] = cahoudta_full_add_model
cahoudta_model_names[[1]] = "cahoudta_full_add_model"
cahoudta_models[[2]] = cahoudta_add_bckwd_bic_model
cahoudta_model_names[[2]] = "cahoudta_add_bckwd_bic_model"
cahoudta_models[[3]] = cahoudta_2int_add_model
cahoudta_model_names[[3]] = "cahoudta_2int_add_model"
cahoudta_models[[4]] = cahoudta_2int_bckwd_bic_model
cahoudta_model_names[[4]] = "cahoudta_2int_bckwd_bic_model"
cahoudta_models[[5]] = cahoudta_lowvif_add_mod
cahoudta_model_names[[5]] = "cahoudta_lowvif_add_mod"
cahoudta_models[[6]] = cahoudta_lowvif_2int_mod
cahoudta_model_names[[6]] = "cahoudta_lowvif_2int_mod"
cahoudta_models[[7]] = cahoudta_lowvif_2poly_mod
cahoudta_model_names[[7]] = "cahoudta_lowvif_2poly_mod"
cahoudta_models[[8]] = cahoudta_lola_add_mod
cahoudta_model_names[[8]] = "cahoudta_lola_add_mod"
cahoudta_models[[9]] = cahoudta_2int_txfd_model
cahoudta_model_names[[9]] = "cahoudta_2int_txfd_model"
# Print Comparative Model performance on train data
model_performance_metrics(cahoudta_models,
cahoudta_model_names,
ca_housing_data.te)
| Adjusted R-Squared | Train RMSE | Test RMSE | LOOCV RMSE | Coefficients | |
|---|---|---|---|---|---|
| cahoudta_full_add_model | 0.7597 | 0.2687 | 0.3407 | 0.2690 | 11 |
| cahoudta_add_bckwd_bic_model | 0.7596 | 0.2688 | 0.3407 | 0.2690 | 10 |
| cahoudta_2int_add_model | 0.7150 | 0.3038 | 0.3090 | 0.3069 | 53 |
| cahoudta_2int_bckwd_bic_model | 0.7144 | 0.3043 | 0.3094 | 0.3065 | 44 |
| cahoudta_lowvif_add_mod | 0.6310 | 0.3463 | 0.3456 | 0.3465 | 9 |
| cahoudta_lowvif_2int_mod | 0.6645 | 0.3299 | 0.3321 | 0.3311 | 34 |
| cahoudta_lowvif_2poly_mod | 0.6667 | 0.3290 | 0.3335 | 0.3321 | 14 |
| cahoudta_lola_add_mod | 0.6515 | 0.3365 | 0.3379 | 0.3368 | 10 |
| cahoudta_2int_txfd_model | 0.7702 | 0.2635 | 1.6615 | 0.2664 | 34 |
#Final_selected_model
final_selected_model <- cahoudta_add_bckwd_bic_model
As mentioned throughout the narrative, our first model
cahoudta_full_add_model had surprisingly good performance
at the outset and became the model to beat. However, as visible from the
results above, the backward BIC search did help in reducing number of
\(\beta\) parameters by 1. Both the
full model cahoudta_full_add_model and backward step wise
search model 'cahoudta_add_bckwd_bic_model had similar
train RMSE, test RMSE, and adjusted \(R^2\). Since the model
cahoudta_add_bckwd_bic_model is smaller, it was a strong
contender for selection after creating only two models, but we also
wanted to remove collinear variables, explore variable transformations,
and leverage interaction terms.
When we tried to reduce the variables manually by removing both
latitude and longitude, we found the
reductions in the adjusted \(R^2\) and
increase in train RMSE unacceptable. Trying more complex models like
two-way interactions and polynomials did not provide significant pay off
in improvements to adjusted \(R^2\) or
decreases in train RMSE. Finally, the variable transformation model
appeared to have a very good fit of the training data set in both the
value of the adjusted \(R^2\) and train
RMSE.
Our group had decided at the outset that the minimization of test
RMSE, an interpretable model, and a reasonable adjusted \(R^2\) were the essential trade-offs when
selecting a model. Looking at test RMSE in particular, we immediately
eliminated the possibility of using the transformed variable model. The
models with the lowest test RMSE had over 40 coefficients, which would
make analysis challenging. Ultimately, the model
cahoudta_add_bckwd_bic_model had the best combination of
adjusted \(R^2\), low enough test RMSE,
and a reasonable number of coefficients. Below, we will discuss the
selected model further.
During the model building process, we explored 9
different models for our housing data set. Recall that the main goal for
this project was to effectively model the California Housing Data and
identify the predictors which help explain the
median_house_value response variable. Based on table Comparative Model
Performance Metrics we select model
cahoudta_add_bckwd_bic_model as our best model for purpose
of this project.
The final selected model has following features:
The following are all the predictors for the model
cahoudta_add_bckwd_bic_model with test statistics and
p-values.
coef_matrix <- data.frame(coef(summary(final_selected_model)))
names(coef_matrix) <- c("Estimate", "Standard Error", "t Value", "P Value")
kable(coef_matrix, digits = 5, caption = "Table of Predictors and Test Statistics")
| Estimate | Standard Error | t Value | P Value | |
|---|---|---|---|---|
| (Intercept) | -2.44607 | 0.43904 | -5.571 | 0.00000 |
| longitude | -0.16094 | 0.00513 | -31.386 | 0.00000 |
| latitude | -0.15245 | 0.00514 | -29.637 | 0.00000 |
| housing_median_age | 0.00301 | 0.00021 | 14.057 | 0.00000 |
| population | -0.00028 | 0.00001 | -45.759 | 0.00000 |
| households | 0.00092 | 0.00002 | 51.446 | 0.00000 |
| median_income | 0.17860 | 0.00142 | 126.148 | 0.00000 |
| ocean_proximityINLAND | -0.33330 | 0.00864 | -38.582 | 0.00000 |
| ocean_proximityNEAR BAY | -0.08728 | 0.00914 | -9.551 | 0.00000 |
| ocean_proximityNEAR OCEAN | -0.02969 | 0.00772 | -3.845 | 0.00012 |
We also ran all the standard diagnostics on this model.
par(mfrow=c(2, 2))
plot(final_selected_model,
pch = 20,
col = green_color)
We also ran model diagnostic tests to check linear regression
assumptions(LINE) with the help of Breusch-Pagan Test and
Shapiro-Wilk Test. Decision of
Breusch-Pagan Test for all the models was to reject the
null hypothesis - meaning constant variable assumption
(heteroscedasticity) is violated. Similarly, we found normality of error
issue when we ran Shapiro-Wilk Test. The p-value for both
the tests for all the models were very low compared to significance
level of 0.05.
get_model_diagnostic_tests(cahoudta_models, cahoudta_model_names)
| B-P Test P-Value | B-P Test Decision | Shapiro-Wilk Test P-Value | Shapiro-Wilk Test Decision | |
|---|---|---|---|---|
| cahoudta_full_add_model | <2e-16 | Reject | 1.7e-08 | Reject |
| cahoudta_add_bckwd_bic_model | <2e-16 | Reject | 1.7e-08 | Reject |
| cahoudta_2int_add_model | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_2int_bckwd_bic_model | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_lowvif_add_mod | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_lowvif_2int_mod | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_lowvif_2poly_mod | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_lola_add_mod | <2e-16 | Reject | <2e-16 | Reject |
| cahoudta_2int_txfd_model | <2e-16 | Reject | 1.1e-11 | Reject |
We are showing the model diagnostic plots for our top 2 models 1)
cahoudta_full_add_model and 2)
cahoudta_add_bckwd_bic_model. As it can be seen, with log
transformation of response we could mitigate the issues of
heteroscedasticity and normality of error assumptions. However, both
these assumptions are srill violated as it can be seen from the plots
below.
# No of models for comparative charts
n_top_models = 2
# create a vector list of models to pass to various functions
top_cahoudta_models = vector("list", length = n_top_models)
top_cahoudta_model_names = vector("character", length = n_top_models)
top_cahoudta_models [[1]] = cahoudta_full_add_model
top_cahoudta_model_names[[1]] = "cahoudta_full_add_model"
top_cahoudta_models[[2]] = cahoudta_add_bckwd_bic_model
top_cahoudta_model_names[[2]] = "cahoudta_add_bckwd_bic_model"
get_model_diagnostic_plots(top_cahoudta_models, top_cahoudta_model_names)
We selected the model from backward selection with the base as the
full additive model for modeling median_house_value. Our
model provides utility in explaining the relationship between how
features of a California Block Group relate to the log of the median
house value of the block group. In this section, we will discuss how
this model is useful and interpret the model’s \(\hat{\beta}\) parameters.
As mentioned in above in Limitations of the Data, our data
are from the 1990 Census, which means we are not using this model for
prediction. Now that we have developed the best model with the available
predictors, we can utilize the model to explain the relationship of
geography to the medain_house_value of a block group. The
following section is useful to understand historical context of the
current California real estate market. We can also use this model to ask
questions about the directionality of the impact of the different
predictors. In other words, our model allows us to answer questions like
“do higher income block groups have higher average
median_house_values?”
There are several things to notice about the below table of \(\hat{\beta}\)’s when discussing the model.
The first is to note is that since we did a log transformation of the
response, we will need to exponentiate the \(\hat{\beta}\)’s to correctly interpret the
multiplicative impact of each of these parameters on
median_house_value. For ease of interpretability, we
exponentiated the coefficients then subtracted 1 (because a coefficient
less than 1 should be interpreted as a percent decrease in the average
median_house_value, whereas a coefficient greater than 1
should be interpreted as a percent increase in the average
median_house_value).
coefs_exponeniated <- data.frame("Betas Exponentiated" = (exp(coef(final_selected_model)) - 1) * 100)
kable(coefs_exponeniated)
| Betas.Exponentiated | |
|---|---|
| (Intercept) | -91.3367 |
| longitude | -14.8658 |
| latitude | -14.1402 |
| housing_median_age | 0.3014 |
| population | -0.0284 |
| households | 0.0925 |
| median_income | 19.5547 |
| ocean_proximityINLAND | -28.3445 |
| ocean_proximityNEAR BAY | -8.3578 |
| ocean_proximityNEAR OCEAN | -2.9256 |
Next, we discuss the role geography played in
median_house_value. For the purpose of discussions, we
included once again the graph of median_house_value based
on latitude and longitude to allow the reader to review the model’s
findings against the geographical spread of the response variable.
hs_value_plot
Please note that our base level for ocean_proximity is
“<1H OCEAN”. The implication is that the average
median_house_value of a block group decreases for any other
ocean_proximity. This is not surprising based on the above
graphic where we see discrete clusters of high value houses in San
Francisco and Los Angeles. The model found that blocks classified as
“INLAND” had a 28.3445% decrease in average
median_house_value relative to the blocks classified as
“<1H OCEAN”.
The other two geographical variables, latitude and longitude, also
have interesting interpretations based on our priors expecting geography
to play a major role in price in 1990 CA block group house value.
Specifically, for every 1 unit increase in latitude (moving Westward),
we see a 14.1402% decrease in median_house_value.
Similarly, for every 1 unit increase in longitude (moving North), we see
a 14.8658% decrease in median_house_value.
The other key \(\hat{\beta}\) to
call attention to is that for median_income. For every 1
unit increase in median income, we see 19.5547% increase in the
median_house_value. In other words, the model found that
blocks with higher incomes would have higher values of
median_house_value.
It is also important to underscore future directions for analyses on this data set that our group was not able to implement due to time constraints or the focus of the material of STAT420. This section is meant to emphasize the careful consideration our team took into ideating about the best way to model the data.
Our team discussed about the potential of extracting meaningful Geo-Spatial insights / predictor variables based on ‘Latitude’ & ‘Longitude’ data. The hypothesis being, houses within 50 miles radius of large cities in California would likely have higher median prices. We considered leveraging Geo-Spatial libraries for this purpose. For example: ‘Haversine Formula’ to calculate distance between two points on a sphere based on latitude and longitude. Another alternative considered was to simply use Euclidian Distance. In the interest of time, we could not create these variables, but we believe a future direction with this data is including these Geo-Spatial variables in modeling.
Looking at the response variable histogram, we found that median house prices are capped at the top. We would like to further explore the impacts of this in our models from a fit perspective - possibly truncating observations with very high median house prices. Due to capping of the response variable, the accuracy of the model in this price range would not be good anyway.
Our last model cahoudta_2int_txfd_model has the highest
adjusted r-square on training data, although it suffers from
over-fitting issue. We would like to investigate this further with the
objective of mitigating over-fitting without sacrificing adjusted
r-square performance.
Constant variance assumption is still getting violated per BP Test results and ‘Fitted vs. Residual’ plot. In future, we would like to further mitigate the issue by trying additional transformations of predictor variables using Box-Cox.Similarly, Q-Q plot for top two models look much improved compared to when we started modeling with the raw data.Time permitting, we would like to further investigate normality of error assumption by using more transformations of predictor variable.
Below we have included additional code used for model selection and a link to our GitHub repo.